Red Wine Quality Exploration by Krissada Chalermsook

The report explore a dataset containing quality and attributes for 1599 red wines with 13 variables. The objective of me is to see which variables influence the quality of red wines

Univariate Plots Section

First, I run the basic function to see the overview of the data.

## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Our dataset consists of 13 variables, with almost 1,599 observations. First, the most important variable that I would like to focus is quality. So, let’s depict this quality using gplot.

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Then, I would like to know the characteristic of other variables by drawing up multiple bin-size histograms to see a distribution.

The left graph did not tell me so much detail about the variable. After changing the bin size to 1, I can see that the graph is right skewed and the most wine has the fix acidity ranged from 7 to 10. After that, I transform it using log 10.

Transforming the graph using log10 seems to make Fixed acidity looks like normal distribution. Then, I continue to work on volatile acidity.

For volatile acidity, there are some outliners that the value > 1.1 I try to plot again by triming these value which value less than 95 percentile.

The graph in the center seems to be normal distribution now. Then, I continue to work on Citric Acid.

Citric Acid seems to be right-skewed. So, I tried to apply square root transformation as below.

Applying square root scale on X made citric acid graph looks like normal distribution. However, it is quite obvious that there are many zero data on citric acid. I would like to know how much on this. So I tried to count it.

## [1] 132

There are 132 rows that have citric acid = 0. It is quite unsual on this. I tried to trimmed out this zero value data and tried to plot the graph ag

The bin size = 1 show that there are around 1,100 wines that have residual sugar in the range 1 to 2 and there are very little wine with the value of residual sugar in the range 8 to 16. I also applied the logplot to residual sugar and the graph looks more like bell-shaped.

I have investigated every variable using this method and I saw that some graphs are not normal distributed. I tried to plot of variables that are not normal distribution with scale_x_log10 again to see the result.

It appear that Fixed acidity, Volatile acidity, Chlorides, Total Sulphur Dioxide, Sulphates turned to be normal distribution.

Next, In order to find the unsual distribution, I created Box plot of all variables.

Reading from the description of variables, I saw that some variables may be grouped because they have similar characteristic such as fixed.acidity and volatile.acidity So, I tried to see that if I created 1 new variable “Total Acid” by summing up fixed acidity, volatile acidity and citric acid, it will show any interesting data or not.

Univariate Analysis

What is the structure of your dataset?

  • There are 1599 red wines in the dataset with 13 variables.
  • The variables are fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dixoxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality
  • The main focus is in quality variable which has value from 0 to 10

Other observations:

  • Some variables have outliners such as “Residual Sugar”, “Chlorides”, “Free Sulfur Dioxide”, “Total Sulfur Dioxide”, “Sulphates”, “Alcohol”"
  • Some variables have normal distribution such as “pH”, “Density”
  • I tried to plot using log scales for non-normal distribution and it is interesting that some variabiles were changed to normal distribution. The variables are Fix acidity, Volatile Acidity, Chloride, Total Sulfur Dioxide, Sulphates.
  • Using boxplot shows me the well-organized of the overview of the distribution of each variable.

What is/are the main feature(s) of interest in your dataset?

  • My main feature is quality. The interest of me is what are the variables that have the impact on quality of the red wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

  • First, I tried to understand each meaning of variable by looking at this link https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt.
  • Then, I plan to focus on the variables that have quite different in value (not Normal Distribution) which not include pH and density.
  • Some variables seem to be similar if we look from the description of them such as pH, fixed acidity, volatile acidity, citric acid.

Did you create any new variables from existing variables in the dataset?

  • I would like to know what will happen if I make “Total acid” by summing up Fixed Acidity, Volatile Acidity and Citric Acid. So, I created another variable total.acid and try to plot it.
df$total.acid <- df$fixed.acidity + df$volatile.acidity + df$citric.acid
qplot(df$total.acid)

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

  • I tried to find the outliner and unsual distribution using boxplot as I plotted in the above section.
  • After analysing the result I found that there are 132 rows that have citric acid = 0, which I am quite not sure that they are unknown or they have value = 0.
  • I also found interesting fact such as the quality of the wine ranged between 3 to 8 and did not have 0,1,2,9,10 which mean the expert may have bias to not rank the wine to be very low and very high.

Bivariate Plots Section

First, I want to see the correlationship between every variable in more detail. I use corrplot to do this (with method = number for exact number and square for better visualization).

After looking at the overview of correlation table, I created below graph using jitter plot and boxplot to see the overview of the corelation of each variable and quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

From the boxplot and jitter plot, I saw that some variables can determine the quality of the wine such as

  • lower volatile acidity
  • higher citric acid
  • higher sulphates
  • lower density
  • lower pH
  • higher alcohol

Then, I tried to explore the correlation value of these above variables with quality of the wine using ggpair and it resulted in the following value.

  • alcohol = 0.476
  • volatile acidity = -0.391
  • sulphates = 0.251
  • citric acid = 0.226
  • density = -0.175
  • pH = -0.0577

It seems that density does not relate to quality so much.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Using the same correlation plot, I try to figure out the features that have correlation more than 0.5 which are:

  • fixed acid with citric acid, density, pH
  • volatile acidity with citric acid
  • citric acid with pH
  • total sulfur dioxide and free sulfur dioxide
  • total acid with ph,density, fixed acid and citric acid

What was the strongest relationship you found?

  • The strongest relationship that I found is total acid and fixed acid which seems reasonable because total acid is composed of fixed acid.

Multivariate Plots Section

I tried to look into only features that have strong correlation with quality which are

Then I tried to analyse other variables that did not relate to the main feature to see some interesting relation of them. First I group the data by total acid and assign them into 3 classes (low, medium, high).

df$acid.class <- ifelse(df$total.acid < 8, 'low', ifelse(
  df$total.acid < 12, 'medium', 'high'))
df$acid.class <- ordered(df$acid.class,
                     levels = c('low', 'medium', 'high'))

And then, I plot acid class with ph, density, fixed acid and citric acid and got the graph below.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

  • The first graph show that citric acid and volatile acid do not seem to have impact on quality that much.
  • It is very clear in graph 2 that higher alcohol and lower volatile acidity make the wine have better quality.
  • Graph 3 and Graph 4 show that citric acid, sulphate and density do not have so much impact on the quality of wine also.

Were there any interesting or surprising interactions between features?

  • It is very surprising that the graph visualization show that only alcohol has a significant impact on the quality of the wine.
  • The graph of pH, fixed acidity, density, citric acid and acid class are very great example of the visualization of correlated variables.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

  • no

Final Plots and Summary

Plot One

Description One

This boxplot is very clear to show that alcohol has an impact to the quality of the wine. The higher alcohol leads to better quality of the wine. However, the outliners show that only alcohol may not produce a good wine quality. I also notice that the quality 3 and 4 have more alcohol that the quality 3 but has lower in quality, which make it a little bit harder to predict the quality using only alcohol.

Plot Two

Description Two

This graph show that pH, density, fix acidity and citric acid are all related to the amount of acid in the wine. - For pH, the higher pH the lower acid class. - For Fixed Acidity, the lower fixed acidity, the lower acid class. - For density, the lower density, the higher acid class. - For citric acid, the lower citric acid, the lower acid class.

Plot Three

Description Three

This graph use the two variables that have the most corelation value with quality and plot together with quality with tramming outliners (limit alcohol 9-14 and limit volatile acidity to 0.15-1.2). It shows that the lower volatile acidity and the higher alcohol can lead to better wine quality.


Reflection

This project help me to be familiar with data analysis using scatterplot, histogram, boxplot, etc. And it is very interesting to find some interesting fact from the data.

The hardest part when I worked with this project is how to extract the important information from the data I have. How can I start and made a right decision to continue in each step. To be more specific on this, I made a decision to choose quality and focus on that which leaded to finding the important correlation between quality and others.

I also think that each variable is quite focus on only chemical components. It may be better if we can use some more variables which are easier to understand such as country, year, color, processes and we may discover some more interesting result.